{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "colab": {
      "provenance": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "POWZoSJR6XzK"
      },
      "source": [
        "# Embeddings SQL custom functions\n",
        "\n",
        "txtai 4.0 added support for SQL-based embeddings queries. This feature combines natural language queries for similarity with concrete filtering rules. txtai now has support for user-defined SQL functions, making this feature even more powerful."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qa_PPKVX6XzN"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
        "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
        "trusted": true,
        "_kg_hide-output": true,
        "id": "24q-1n5i6XzQ"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Create index\n",
        "Let's first recap how to create an index. We'll use the classic txtai example.\n"
      ],
      "metadata": {
        "id": "0p3WCDniUths"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
        "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
        "trusted": true,
        "id": "2j_CFGDR6Xzp",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "f2488a78-6cae-4c25-985e-fb2dd674a534"
      },
      "source": [
        "from txtai.embeddings import Embeddings\n",
        "\n",
        "data = [\"US tops 5 million confirmed virus cases\",\n",
        "        \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
        "        \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
        "        \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
        "        \"Maine man wins $1M from $25 lottery ticket\",\n",
        "        \"Make huge profits without work, earn up to $100,000 a day\"]\n",
        "\n",
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True})\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "# Run a search\n",
        "embeddings.search(\"feel good story\", 1)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '4',\n",
              "  'score': 0.08329004049301147,\n",
              "  'text': 'Maine man wins $1M from $25 lottery ticket'}]"
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Custom SQL functions\n",
        "\n",
        "Next, we'll recreate the index adding user-defined SQL functions. These functions are simply Python callable objects or functions that take an input and return values. Pipelines, workflows, custom tasks and any other callable object is supported."
      ],
      "metadata": {
        "id": "QTee7YMNDD4R"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def clength(text):\n",
        "  return len(text) if text else 0\n",
        "\n",
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True, \"functions\": [clength]})\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "# Run a search using a custom SQL function\n",
        "embeddings.search(\"select clength(text) clength, length(text) length, text from txtai where similar('feel good story')\", 1)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "rbsEXtysDDNg",
        "outputId": "f966be17-086b-49b4-e1af-62b766f1c995"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'clength': 42,\n",
              "  'length': 42,\n",
              "  'text': 'Maine man wins $1M from $25 lottery ticket'}]"
            ]
          },
          "metadata": {},
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The function itself is simple, it's just alternate length function. But this example is just warming us up to what is possible and what is more exciting. "
      ],
      "metadata": {
        "id": "epIV58P1DyZa"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Pipelines in SQL\n",
        "\n",
        "As mentioned above, any callable can be registered as a custom SQL function. Let's add a translate SQL function."
      ],
      "metadata": {
        "id": "1Iw1WKR6FW3S"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.pipeline import Translation\n",
        "\n",
        "# Translation pipeline\n",
        "translate = Translation()\n",
        "\n",
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True, \"functions\": [translate]})\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "query = \"\"\"\n",
        "select\n",
        "  text,\n",
        "  translation(text, 'de', null) 'text (DE)',\n",
        "  translation(text, 'es', null) 'text (ES)',\n",
        "  translation(text, 'fr', null) 'text (FR)'\n",
        "from txtai where similar('feel good story')\n",
        "limit 1\n",
        "\"\"\"\n",
        "\n",
        "# Run a search using a custom SQL function\n",
        "embeddings.search(query)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "83e8yXpXFh4F",
        "outputId": "0b17e9be-8983-418d-9903-b1e72efc5918"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'text': 'Maine man wins $1M from $25 lottery ticket',\n",
              "  'text (DE)': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket',\n",
              "  'text (ES)': 'Maine hombre gana $1M de billete de lotería de $25',\n",
              "  'text (FR)': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]"
            ]
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "And just like that we have translations through SQL! This is pretty 🔥🔥🔥\n",
        "\n",
        "We can do more to make this easier though. Let's define a helper function to not require as many parameters. The default logic will require all function parameters each call, including parameters with default values."
      ],
      "metadata": {
        "id": "Ck_XTyBEQtaW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def translation(text, lang):\n",
        "  return translate(text, lang)\n",
        "\n",
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True, \"functions\": [translation]})\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "query = \"\"\"\n",
        "select\n",
        "  text,\n",
        "  translation(text, 'de') 'text (DE)',\n",
        "  translation(text, 'es') 'text (ES)',\n",
        "  translation(text, 'fr') 'text (FR)'\n",
        "from txtai where similar('feel good story')\n",
        "limit 1\n",
        "\"\"\"\n",
        "\n",
        "# Run a search using a custom SQL function\n",
        "embeddings.search(query)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "L2DDJrd0RAaN",
        "outputId": "0bb437ec-5c9b-4a0c-fe8a-07f641c94a49"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'text': 'Maine man wins $1M from $25 lottery ticket',\n",
              "  'text (DE)': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket',\n",
              "  'text (ES)': 'Maine hombre gana $1M de billete de lotería de $25',\n",
              "  'text (FR)': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]"
            ]
          },
          "metadata": {},
          "execution_count": 17
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Custom SQL functions with applications\n",
        "\n",
        "Of course this is all available with YAML-configured applications."
      ],
      "metadata": {
        "id": "mTT8nopiRdVH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "config = \"\"\"\n",
        "translation:\n",
        "\n",
        "writable: true\n",
        "embeddings:\n",
        "  path: sentence-transformers/nli-mpnet-base-v2\n",
        "  content: true\n",
        "  functions:\n",
        "    - {name: translation, argcount: 2, function: translation}\n",
        "\"\"\"\n",
        "\n",
        "from txtai.app import Application\n",
        "\n",
        "# Build application and index data\n",
        "app = Application(config)\n",
        "app.add([{\"id\": x, \"text\": row} for x, row in enumerate(data)])\n",
        "app.index()\n",
        "\n",
        "# Run search with custom SQL\n",
        "app.search(query)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FZ_7G6M4RUbz",
        "outputId": "4eca94f3-d2aa-4449-dc6f-f1091ad9dd67"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'text': 'Maine man wins $1M from $25 lottery ticket',\n",
              "  'text (DE)': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket',\n",
              "  'text (ES)': 'Maine hombre gana $1M de billete de lotería de $25',\n",
              "  'text (FR)': 'Maine homme gagne $1M à partir de $25 billet de loterie'}]"
            ]
          },
          "metadata": {},
          "execution_count": 18
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aDIF3tYt6X0O"
      },
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook introduced running user-defined custom SQL functions through embeddings SQL. This powerful feature can be used with any callable function including pipelines, tasks and workflows in tandem with similarity and rules filters."
      ]
    }
  ]
}